8 research outputs found

    Comparative Analyses of De Novo Transcriptome Assembly Pipelines for Diploid Wheat

    Get PDF
    Gene expression and transcriptome analysis are currently one of the main focuses of research for a great number of scientists. However, the assembly of raw sequence data to obtain a draft transcriptome of an organism is a complex multi-stage process usually composed of pre-processing, assembling, and post-processing. Each of these stages includes multiple steps such as data cleaning, error correction and assembly validation. Different combinations of steps, as well as different computational methods for the same step, generate transcriptome assemblies with different accuracy. Thus, using a combination that generates more accurate assemblies is crucial for any novel biological discoveries. Implementing accurate transcriptome assembly requires a great knowledge of different algorithms, bioinformatics tools and software that can be used in an analysis pipeline. Many pipelines can be represented as automated scalable scientific workflows that can be run simultaneously on powerful distributed and computational resources, such as Campus Clusters, Grids, and Clouds, and speed-up the analyses. In this thesis, we 1) compared and optimized de novo transcriptome assembly pipelines for diploid wheat; 2) investigated the impact of a few key parameters for generating accurate transcriptome assemblies, such as digital normalization and error correction methods, de novo assemblers and k-mer length strategies; 3) built distributed and scalable scientific workflow for blast2cap3, a step from the transcriptome assembly pipeline for protein-guided assembly, using the Pegasus Workflow Management System (WMS); and 4) deployed and examined the scientific workflow for blast2cap3 on two different computational platforms. Based on the analysis performed in this thesis, we conclude that the best transcriptome assembly is produced when the error correction method is used with Velvet Oases and the “multi-k” strategy. Moreover, the performed experiments show that the Pegasus WMS implementation of blast2cap3 reduces the running time for more than 95% compared to its current serial implementation. The results presented in this thesis provide valuable insight for designing good de novo transcriptome assembly pipeline and show the importance of using scientific workflows for executing computationally demanding pipelines. Advisor: Jitender S. Deogu

    Systems-Based Approach for Optimization of Assembly-Free Bacterial MLST Mapping

    Get PDF
    Epidemiological surveillance of bacterial pathogens requires real-time data analysis with a fast turnaround, while aiming at generating two main outcomes: (1) species-level identification and (2) variant mapping at different levels of genotypic resolution for population-based tracking and surveillance, in addition to predicting traits such as antimicrobial resistance (AMR). Multilocus sequence typing (MLST) aids this process by identifying sequence types (ST) based on seven ubiquitous genome-scattered loci. In this paper, we selected one assembly-dependent and one assembly-free method for ST mapping and applied them with the default settings and ST schemes they are distributed with, and systematically assessed their accuracy and scalability across a wide array of phylogenetically divergent Public Health-relevant bacterial pathogens with available MLST databases. Our data show that the optimal k-mer length for stringMLST is species-specific and that genome-intrinsic and -extrinsic features can affect the performance and accuracy of the program. Although suitable parameters could be identified for most organisms, there were instances where this program may not be directly deployable in its current format. Next, we integrated stringMLST into our freely available and scalable hierarchical-based population genomics platform, ProkEvo, and further demonstrated how the implementation facilitates automated, reproducible bacterial population analysis

    ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

    Get PDF
    Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance

    Heuristic and Hierarchical-Based Population Mining of Salmonella enterica Lineage I Pan-Genomes as a Platform to Enhance Food Safety

    Get PDF
    The recent incorporation of bacterial whole-genome sequencing (WGS) into Public Health laboratories has enhanced foodborne outbreak detection and source attribution. As a result, large volumes of publicly available datasets can be used to study the biology of foodborne pathogen populations at an unprecedented scale. To demonstrate the application of a heuristic and agnostic hierarchical population structure guided pan-genome enrichment analysis (PANGEA), we used populations of S. enterica lineage I to achieve two main objectives: (i) show how hierarchical population inquiry at different scales of resolution can enhance ecological and epidemiological inquiries; and (ii) identify population-specific inferable traits that could provide selective advantages in food production environments. Publicly available WGS data were obtained from NCBI database for three serovars of Salmonella enterica subsp. enterica lineage I (S. Typhimurium, S. Newport, and S. Infantis). Using the hierarchical genotypic classifications (Serovar, BAPS1, ST, cgMLST), datasets from each of the three serovars showed varying degrees of clonal structuring. When the accessory genome (PANGEA) was mapped onto these hierarchical structures, accessory loci could be linked with specific genotypes. A large heavy-metal resistance mobile element was found in the Monophasic ST34 lineage of S. Typhimurium, and laboratory testing showed that Monophasic isolates have on average a higher degree of copper resistance than the Biphasic ones. In S. Newport, an extra sugEgene copy was found among most isolates of the ST45 lineage, and laboratory testing of multiple isolates confirmed that isolates of S. Newport ST45 were on average less sensitive to the disinfectant cetylpyridimium chloride than non-ST45 isolates. Lastly, data-mining of the accessory genomic content of S. Infantis revealed two cryptic Ecotypes with distinct accessory genomic content and distinct ecological patterns. Poultry appears to be themajor reservoir for Ecotype 1, and temporal analysis further suggested a recent ecological succession, with Ecotype 2 apparently being displaced by Ecotype 1. Altogether, the use of a heuristic hierarchical-based population structure analysis that includes bacterial pan-genomes (core and accessory genomes) can (1) improve genomic resolution for mapping populations and accessing epidemiological patterns; and (2) define lineage-specific informative loci that may be associated with survival in the food chain

    Salmonella enterica induces biogeography-specific changes in the gut microbiome of pigs

    Get PDF
    Swine are a major reservoir of an array of zoonotic Salmonella enterica subsp. enterica lineage I serovars including Derby, Typhimurium, and 4,[5],12:i:- (a.k.a. Monophasic Typhimurium). In this study, we assessed the gastrointestinal (GI) microbiome composition of pigs in different intestinal compartments and the feces following infection with specific zoonotic serovars of S. enterica (S. Derby, S. Monophasic, and S. Typhimurium). 16S rRNA based microbiome analysis was performed to assess for GI microbiome changes in terms of diversity (alpha and beta), community structure and volatility, and specific taxa alterations across GI biogeography (small and large intestine, feces) and days post-infection (DPI) 2, 4, and 28; these results were compared to disease phenotypes measured as histopathological changes. As previously reported, only S. Monophasic and S. Typhimurium induced morphological alterations that marked an inflammatory milieu restricted to the large intestine in this experimental model. S. Typhimurium alone induced significant changes at the alpha- (Simpson’s and Shannon’s indexes) and beta-diversity levels, specifically at the peak of inflammation in the large intestine and feces. Increased community dispersion and volatility in colonic apex and fecal microbiomes were also noted for S. Typhimurium. All three Salmonella serovars altered community structure as measured by co-occurrence networks; this was most prominent at DPI 2 and 4 in colonic apex samples. At the genus taxonomic level, a diverse array of putative short-chain fatty acid (SCFA) producing bacteria were altered and often decreased during the peak of inflammation at DPI 2 and 4 within colonic apex and fecal samples. Among all putative SCFA producing bacteria, Prevotella showed a broad pattern of negative correlation with disease scores at the peak of inflammation. In addition, Prevotella 9 was found to be significantly reduced in all Salmonella infected groups compared to the control at DPI 4 in the colonic apex. In conclusion, this work further elucidates that distinct swine-related zoonotic serovars of S. enterica can induce both shared (high resilience) and unique (altered resistance) alterations in gut microbiome biogeography, which helps inform future investigations of dietary modifications aimed at increasing colonization resistance against Salmonella through GI microbiome alterations

    Addressing Bioinformatics Bottlenecks for Scalable Microbial Population Genomics Analyses

    No full text
    With population genomics analyses, researchers can understand genetic relationships in populations and their environments, find genomic patterns, and for pathogenic organisms, especially microorganisms, track outbreaks and develop treatments with high accuracy. The process of doing population genomics starts with raw sequencing data and ends with genotypic mapping. However, to perform genomics analyses on a whole population scale, we need powerful computational platforms and efficient methods that work well with large data and generate accurate outcomes. In this dissertation, the two focal bottlenecks for performing efficient and accurate microbial population analyses are addressed: 1) the need for scalable and effective computational platform that utilizes powerful computational resources; and 2) strategic algorithm selection for various steps of population genomics analyses by exploring three main applications: i) automated and scalable multi-step bioinformatics pipeline; ii) accuracy of tools for read mapping; and iii) real-time sequence typing of foodborne pathogens. As part of this dissertation, we: 1) built ProkEvo, an automated and scalable platform for bacterial population genomics; 2) deployed ProkEvo on two different computational platforms; and 3) provided application case studies of ProkEvo. Next, we investigated the accuracy of mapping and alignment tools for long sequencing reads by: 4) building a consistent set of benchmarks using simulated data; 5) defining stringent assessment metrics; 6) using a range of thresholds to reflect their true accuracy. Furthermore, we focused on real-time sequence typing of foodborne pathogens and: 7) performed systematic and comprehensive comparison between assembly-dependent and assembly-free methods for scalable bacterial MLST mapping; 8) showed that the accuracy of these methods and the optimal k-mer length are species-specific; and 9) incorporated both methods in ProkEvo. With the experiments performed in this dissertation, we provide useful guidelines for strategic algorithm selection of the steps part of the population genomics analyses. Moreover, we conclude that ProkEvo provides a practical and viable platform for scalable automated analyses of bacterial populations that can be applied in microbiology research, clinical diagnostics, and epidemiological surveillance

    A Comparison of a Campus Cluster and Open Science Grid Platforms for Protein- Guided Assembly using Pegasus Workflow Management System

    Get PDF
    Scientific workflows are a useful tool for managing large and complex computational tasks. Due to its intensive resource requirements, the scientific workflows are often executed on distributed platforms, including campus clusters, grids and clouds. In this paper we build a scientific workflow for blast2cap3, the protein-guided assembly, using the Pegasus Workflow Management System (Pegasus WMS). The modularity of blast2cap3 allows us to decompose the existing serial approach on multiple tasks, some of which can be run in parallel. Afterwards, this workflow is deployed on two distributed execution platforms: Sandhills, the University of Nebraska Campus Cluster, and the Open Science Grid (OSG). We compare and evaluate the performance of the built workflow for the both platforms. Furthermore, we also investigate the influence of the number of clusters of transcripts in the blast2cap3 workflow over the total running time. The performed experiments show that the Pegasus WMS implementation of blast2cap3 significantly reduces the running time compared to the current serial implementation of blast2cap3 for more than 95 %. Although OSG provides more computational resources than Sandhills, our workflow experimental runs have better running time on Sandhills. Moreover, the selection of 300 clusters of transcripts gives the optimum performance with the resources allocated from Sandhills

    Systems-Based Approach for Optimization of Assembly-Free Bacterial MLST Mapping

    No full text
    Epidemiological surveillance of bacterial pathogens requires real-time data analysis with a fast turnaround, while aiming at generating two main outcomes: (1) species-level identification and (2) variant mapping at different levels of genotypic resolution for population-based tracking and surveillance, in addition to predicting traits such as antimicrobial resistance (AMR). Multi-locus sequence typing (MLST) aids this process by identifying sequence types (ST) based on seven ubiquitous genome-scattered loci. In this paper, we selected one assembly-dependent and one assembly-free method for ST mapping and applied them with the default settings and ST schemes they are distributed with, and systematically assessed their accuracy and scalability across a wide array of phylogenetically divergent Public Health-relevant bacterial pathogens with available MLST databases. Our data show that the optimal k-mer length for stringMLST is species-specific and that genome-intrinsic and -extrinsic features can affect the performance and accuracy of the program. Although suitable parameters could be identified for most organisms, there were instances where this program may not be directly deployable in its current format. Next, we integrated stringMLST into our freely available and scalable hierarchical-based population genomics platform, ProkEvo, and further demonstrated how the implementation facilitates automated, reproducible bacterial population analysis
    corecore